Buying and selling used phones and tablets used to be something that happened on a handful of online marketplace sites. But the used and refurbished device market has grown considerably over the past decade, and a new IDC (International Data Corporation) forecast predicts that the used phone market would be worth \$52.7bn by 2023 with a compound annual growth rate (CAGR) of 13.6% from 2018 to 2023. This growth can be attributed to an uptick in demand for used phones and tablets that offer considerable savings compared with new models.
Refurbished and used devices continue to provide cost-effective alternatives to both consumers and businesses that are looking to save money when purchasing one. There are plenty of other benefits associated with the used device market. Used and refurbished devices can be sold with warranties and can also be insured with proof of purchase. Third-party vendors/platforms, such as Verizon, Amazon, etc., provide attractive offers to customers for refurbished devices. Maximizing the longevity of devices through second-hand trade also reduces their environmental impact and helps in recycling and reducing waste. The impact of the COVID-19 outbreak may further boost this segment as consumers cut back on discretionary spending and buy phones and tablets only for immediate needs.
The rising potential of this comparatively under-the-radar market fuels the need for an ML-based solution to develop a dynamic pricing strategy for used and refurbished devices. ReCell, a startup aiming to tap the potential in this market, has hired you as a data scientist. They want you to analyze the data provided and build a linear regression model to predict the price of a used phone/tablet and identify factors that significantly influence it.
The data contains the different attributes of used/refurbished phones and tablets. The data was collected in the year 2021. The detailed data dictionary is given below.
#manipulating and analyzing data
import numpy as np
import pandas as pd
#data visualization
import seaborn as sns
import matplotlib.pyplot as plt
#splitting data into train and test
from sklearn.model_selection import train_test_split
#making linear regression model
import statsmodels.api as sm
#checking performance of model
from sklearn.metrics import mean_absolute_error, mean_squared_error
#assigning devices dataframe to the data stored in the csv file
devices = pd.read_csv('used_device_data.csv')
#viewing first 5 rows of data
devices.head()
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | normalized_used_price | normalized_new_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Honor | Android | 14.50 | yes | no | 13.0 | 5.0 | 64.0 | 3.0 | 3020.0 | 146.0 | 2020 | 127 | 4.307572 | 4.715100 |
| 1 | Honor | Android | 17.30 | yes | yes | 13.0 | 16.0 | 128.0 | 8.0 | 4300.0 | 213.0 | 2020 | 325 | 5.162097 | 5.519018 |
| 2 | Honor | Android | 16.69 | yes | yes | 13.0 | 8.0 | 128.0 | 8.0 | 4200.0 | 213.0 | 2020 | 162 | 5.111084 | 5.884631 |
| 3 | Honor | Android | 25.50 | yes | yes | 13.0 | 8.0 | 64.0 | 6.0 | 7250.0 | 480.0 | 2020 | 345 | 5.135387 | 5.630961 |
| 4 | Honor | Android | 15.32 | yes | no | 13.0 | 8.0 | 64.0 | 3.0 | 5000.0 | 185.0 | 2020 | 293 | 4.389995 | 4.947837 |
#viewing last 5 rows of data
devices.tail()
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | normalized_used_price | normalized_new_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3449 | Asus | Android | 15.34 | yes | no | NaN | 8.0 | 64.0 | 6.0 | 5000.0 | 190.0 | 2019 | 232 | 4.492337 | 6.483872 |
| 3450 | Asus | Android | 15.24 | yes | no | 13.0 | 8.0 | 128.0 | 8.0 | 4000.0 | 200.0 | 2018 | 541 | 5.037732 | 6.251538 |
| 3451 | Alcatel | Android | 15.80 | yes | no | 13.0 | 5.0 | 32.0 | 3.0 | 4000.0 | 165.0 | 2020 | 201 | 4.357350 | 4.528829 |
| 3452 | Alcatel | Android | 15.80 | yes | no | 13.0 | 5.0 | 32.0 | 2.0 | 4000.0 | 160.0 | 2020 | 149 | 4.349762 | 4.624188 |
| 3453 | Alcatel | Android | 12.83 | yes | no | 13.0 | 5.0 | 16.0 | 2.0 | 4000.0 | 168.0 | 2020 | 176 | 4.132122 | 4.279994 |
Each row in the dataset holds information for one device.
#viewing the number of rows and columns in the data
devices.shape
(3454, 15)
There are 3,454 rows and 15 columns in the devices dataframe.
#viewing information about each column and its datatype
devices.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3454 entries, 0 to 3453 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 brand_name 3454 non-null object 1 os 3454 non-null object 2 screen_size 3454 non-null float64 3 4g 3454 non-null object 4 5g 3454 non-null object 5 main_camera_mp 3275 non-null float64 6 selfie_camera_mp 3452 non-null float64 7 int_memory 3450 non-null float64 8 ram 3450 non-null float64 9 battery 3448 non-null float64 10 weight 3447 non-null float64 11 release_year 3454 non-null int64 12 days_used 3454 non-null int64 13 normalized_used_price 3454 non-null float64 14 normalized_new_price 3454 non-null float64 dtypes: float64(9), int64(2), object(4) memory usage: 404.9+ KB
#viewing statistical summary of all of the columns in the devices dataframe
#using include='all' to make sure the summary also includes categorical variables
devices.describe(include='all')
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | normalized_used_price | normalized_new_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 3454 | 3454 | 3454.000000 | 3454 | 3454 | 3275.000000 | 3452.000000 | 3450.000000 | 3450.000000 | 3448.000000 | 3447.000000 | 3454.000000 | 3454.000000 | 3454.000000 | 3454.000000 |
| unique | 34 | 4 | NaN | 2 | 2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| top | Others | Android | NaN | yes | no | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| freq | 502 | 3214 | NaN | 2335 | 3302 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| mean | NaN | NaN | 13.713115 | NaN | NaN | 9.460208 | 6.554229 | 54.573099 | 4.036122 | 3133.402697 | 182.751871 | 2015.965258 | 674.869716 | 4.364712 | 5.233107 |
| std | NaN | NaN | 3.805280 | NaN | NaN | 4.815461 | 6.970372 | 84.972371 | 1.365105 | 1299.682844 | 88.413228 | 2.298455 | 248.580166 | 0.588914 | 0.683637 |
| min | NaN | NaN | 5.080000 | NaN | NaN | 0.080000 | 0.000000 | 0.010000 | 0.020000 | 500.000000 | 69.000000 | 2013.000000 | 91.000000 | 1.536867 | 2.901422 |
| 25% | NaN | NaN | 12.700000 | NaN | NaN | 5.000000 | 2.000000 | 16.000000 | 4.000000 | 2100.000000 | 142.000000 | 2014.000000 | 533.500000 | 4.033931 | 4.790342 |
| 50% | NaN | NaN | 12.830000 | NaN | NaN | 8.000000 | 5.000000 | 32.000000 | 4.000000 | 3000.000000 | 160.000000 | 2015.500000 | 690.500000 | 4.405133 | 5.245892 |
| 75% | NaN | NaN | 15.340000 | NaN | NaN | 13.000000 | 8.000000 | 64.000000 | 4.000000 | 4000.000000 | 185.000000 | 2018.000000 | 868.750000 | 4.755700 | 5.673718 |
| max | NaN | NaN | 30.710000 | NaN | NaN | 48.000000 | 32.000000 | 1024.000000 | 12.000000 | 9720.000000 | 855.000000 | 2020.000000 | 1094.000000 | 6.619433 | 7.847841 |
#viewing the statistical summary for numerical columns only
devices.describe()
| screen_size | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | normalized_used_price | normalized_new_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 3454.000000 | 3275.000000 | 3452.000000 | 3450.000000 | 3450.000000 | 3448.000000 | 3447.000000 | 3454.000000 | 3454.000000 | 3454.000000 | 3454.000000 |
| mean | 13.713115 | 9.460208 | 6.554229 | 54.573099 | 4.036122 | 3133.402697 | 182.751871 | 2015.965258 | 674.869716 | 4.364712 | 5.233107 |
| std | 3.805280 | 4.815461 | 6.970372 | 84.972371 | 1.365105 | 1299.682844 | 88.413228 | 2.298455 | 248.580166 | 0.588914 | 0.683637 |
| min | 5.080000 | 0.080000 | 0.000000 | 0.010000 | 0.020000 | 500.000000 | 69.000000 | 2013.000000 | 91.000000 | 1.536867 | 2.901422 |
| 25% | 12.700000 | 5.000000 | 2.000000 | 16.000000 | 4.000000 | 2100.000000 | 142.000000 | 2014.000000 | 533.500000 | 4.033931 | 4.790342 |
| 50% | 12.830000 | 8.000000 | 5.000000 | 32.000000 | 4.000000 | 3000.000000 | 160.000000 | 2015.500000 | 690.500000 | 4.405133 | 5.245892 |
| 75% | 15.340000 | 13.000000 | 8.000000 | 64.000000 | 4.000000 | 4000.000000 | 185.000000 | 2018.000000 | 868.750000 | 4.755700 | 5.673718 |
| max | 30.710000 | 48.000000 | 32.000000 | 1024.000000 | 12.000000 | 9720.000000 | 855.000000 | 2020.000000 | 1094.000000 | 6.619433 | 7.847841 |
#viewing missing values in the devices dataframe
devices.isnull().sum()
brand_name 0 os 0 screen_size 0 4g 0 5g 0 main_camera_mp 179 selfie_camera_mp 2 int_memory 4 ram 4 battery 6 weight 7 release_year 0 days_used 0 normalized_used_price 0 normalized_new_price 0 dtype: int64
#viewing duplicate values in the devices dataframe
devices.duplicated().sum()
0
There are no duplicate values.
#creating histogram for brand_name column to examine distribution of brand_name/color is set to thistle
sns.histplot(data=devices, x='brand_name', color='thistle') #data from devices is used to plot brand_name on x-axis
plt.xticks(rotation=90, size=9); #x-axis labels are rotated 90 degrees and their size is set to 9
plt.title('brand_name') #title is set to brand_name (column name)
plt.show(); #displaying graph
#creating histogram for each of the other columns in the devices dataframe
for i in devices.columns: #for loop for each column in devices
if (i=='brand_name'): #condition for if the column is brand_name
pass #skipping brand_name because it is already plotted above
else: #condition for all of the other columns/color is set to thistle
sns.histplot(data=devices, x=i, color='thistle') #using data from devices to plot the column on x-axis
plt.title(i) #title of graph is set to column name
plt.show(); #displaying graph
#assigning all_columns variable for all the variables except brand_name
all_columns = devices.columns.drop('brand_name')
#using one pair grid for all columns would be overwhelming
#first few columns used in the grid are assigned a new variable first_var
first_var = devices.loc[:,['screen_size','main_camera_mp','selfie_camera_mp','int_memory','ram']]
#the other columns are assigned last_var variable
last_var = devices.loc[:,['battery','weight','release_year','days_used']]
#creating a pair grid using data from devices datafrane
#y-variables in the grid are all columns except brand_name
#x-variables are columns in first_var
#.map() function used with sns.scatterplot to make scatterplots for the columns in the pair grid
sns.PairGrid(data=devices, y_vars=all_columns, x_vars=first_var).map(sns.scatterplot)
plt.show(); #displaying the pair grid
#creating pair grid with data from devices dataframe
#y_variables are all columns except brand_name
#x-variables are columns in last_var
#.map() function used with sns.scatterplot to make scatterplot for the columns in the pair grid
sns.PairGrid(data=devices, y_vars=all_columns, x_vars=last_var).map(sns.scatterplot)
plt.show(); #displaying pair grid
#assigning a variable called devices_numerical to only the numerical columns in the devices dataframe
devices_numerical = devices.select_dtypes(np.number)
#creating heatmap with numerical columns in devices to show correlation in shades of blue color
#the number labels are shown with annot=True and the range is from -1 to 1
#labels are limited to 2 decimal spaces
sns.heatmap(devices_numerical.corr(), annot=True, vmin=-1, vmax=1, fmt='.2f', cmap='Blues')
plt.show(); #displaying heatmap
1) What does the distribution of normalized used device prices look like?
#creating histogram using data from devices/plotting normalized_used_price on x-axis
#density curve is shown using kde=True, and color is set to light steel blue
sns.histplot(data=devices, x='normalized_used_price', kde=True, color='lightsteelblue')
plt.title('Distribution of Normalized Used Device Prices') #setting title of histogram
plt.xlabel('Normalized Used Device Price') #setting title of x-axis
plt.ylabel('Number of Devices') #setting title of y-axis
plt.show(); #displaying histogram
2) What percentage of the used device market is dominated by Android devices?
#creating histogram using data from devices with os on the x-axis
#color is set to light steel blue
sns.histplot(data=devices, x='os',color='lightsteelblue')
plt.title('Operating System for Used Devices') #setting title of histogram
plt.xlabel('Operating System') #setting title of x-axis
plt.ylabel('Number of Devices') #setting title of y-axis
plt.show(); #displaying histogram
#calculating the percent of android devices
#finding the number of rows where the os column in devices is android using shape[0]
#and dividing that number by the total number of os columns in devices
#multiplying that number by 100 to get the percentage
#assigning the percentage to a variable called percent_android
percent_android = (devices[devices['os'] == 'Android'].shape[0]/devices['os'].shape[0])*100
#printing the value of percent_android
print(percent_android)
93.05153445280834
93% of the used devices have an Android operating system.
3)The amount of RAM is important for the smooth functioning of a device. How does the amount of RAM vary with the brand?
#creating scatterplot using data from devices/plotting brand_name on x-axis and ram on y-axis
#color is set to light steel blue
sns.scatterplot(data=devices, x='brand_name', y='ram',color='lightsteelblue')
plt.xticks(rotation=90,size=9) #x-axis labels are rotated 90 degrees and size is set to 9 to avoid overcrowding
plt.title('RAM vs. Brand Name') #setting title of scatterplot
plt.xlabel('Brand Name') #setting title of x-axis
plt.ylabel('RAM') #setting title of y-axis
plt.show(); #displaying scatterplot
#creating histogram using data from devices and placing ram on x-axis
#color is set to light steel blue
sns.histplot(data=devices, x='ram', color='lightsteelblue')
plt.title('Distribution of RAM') #setting title of histogram
plt.xlabel('RAM') #setting title of x-axis
plt.ylabel('Number of Devices') #setting title of y-axis
plt.show(); #displaying histogram
4) A large battery often increases a device's weight, making it feel uncomfortable in the hands. How does the weight vary for phones and tablets offering large batteries (more than 4500 mAh)?
#creating scatterplot using data from devices/plotting battery on x-axis and weight on y-axis
#color is set to light steel blue
sns.scatterplot(data=devices,x='battery', y='weight', color='lightsteelblue')
plt.title('Weight vs. Battery') #setting title of scatterplot
plt.xlabel('Battery') #setting title of x-axis
plt.ylabel('Weight') #setting title of y-axis
plt.show(); #displaying scatterplot
5) Bigger screens are desirable for entertainment purposes as they offer a better viewing experience. How many phones and tablets are available across different brands with a screen size larger than 6 inches?
#converting cm to 6 inches
#1 in = 2.54 cm
#6 in = 2.54(6) cm
six_inches = 2.54*6
#assigning a variable called large screens to devices where screen_size is greater than 6 inches
large_screens = devices[devices['screen_size'] > six_inches ]
#displaying the number of screen_size values there are in the large_screens variable using .count() function
#organizing the number of screen_size values by brand_name using .groupby(by=['brand_name']) function
#organizing these by descending order using .sort_values(ascending=False) function
large_screens.groupby(by=['brand_name'])['screen_size'].count().sort_values(ascending=False)
brand_name Huawei 149 Samsung 119 Others 99 Vivo 80 Honor 72 Oppo 70 Xiaomi 69 Lenovo 69 LG 59 Motorola 42 Asus 41 Realme 40 Alcatel 26 Apple 24 Acer 19 ZTE 17 Meizu 17 OnePlus 16 Nokia 15 Sony 12 Infinix 10 HTC 7 Micromax 7 Google 4 Gionee 3 XOLO 3 Coolpad 3 Karbonn 2 Panasonic 2 Spice 2 Microsoft 1 Name: screen_size, dtype: int64
6) A lot of devices nowadays offer great selfie cameras, allowing us to capture our favorite moments with loved ones. What is the distribution of devices offering greater than 8MP selfie cameras across brands?
#assigning variable called large_selfie_camera to devices where selfie_camera_mp is larger than 8
large_selfie_camera = devices[devices['selfie_camera_mp'] > 8]
#finding the total number of selfie_camera_mp values in large_selfie_camera using .count() function
#organizing those values by brand_name using .groupby(by='brand_name') function
#organizing those values into descending order using .sort_values(ascending=False) function
large_selfie_camera.groupby(by = ['brand_name'])['selfie_camera_mp'].count().sort_values(ascending=False)
brand_name Huawei 87 Vivo 78 Oppo 75 Xiaomi 63 Samsung 57 Honor 41 Others 34 LG 32 Motorola 26 Meizu 24 ZTE 20 HTC 20 OnePlus 18 Realme 18 Sony 14 Lenovo 14 Nokia 10 Asus 6 Infinix 4 Gionee 4 Coolpad 3 Micromax 2 Panasonic 2 BlackBerry 2 Acer 1 Name: selfie_camera_mp, dtype: int64
7) Which attributes are highly correlated with the normalized price of a used device?
#creating heatmap showing correlation of the perviously created variable devices_numerical (only numerical columns in devices dataframe)
#labels are shown with annot=true, and correlation values are from -1 to 1
#labels are limited to 2 decimal points and color is in shades of blue
sns.heatmap(devices_numerical.corr(), annot=True, vmin=-1, vmax=1, fmt='.2f', cmap='Blues')
plt.show(); #displaying heatmap
#viewing missing values in devices
devices.isnull().sum()
brand_name 0 os 0 screen_size 0 4g 0 5g 0 main_camera_mp 179 selfie_camera_mp 2 int_memory 4 ram 4 battery 6 weight 7 release_year 0 days_used 0 normalized_used_price 0 normalized_new_price 0 dtype: int64
#assigning a new variable to the dataframe copy because values need to be filled in without changing the original dataset
devices1 = devices.copy()
#filling in the null values of the main_camera_mp column in devices1 dataframe with the median of the column values when they are organized by brand_name
devices1['main_camera_mp'] = devices1['main_camera_mp'].fillna(value=devices1.groupby(by=['brand_name'])['main_camera_mp'].transform('median'))
#filling null values of selfie_camera_mp column with median of column values when organized by brand_name
devices1['selfie_camera_mp'] = devices1['selfie_camera_mp'].fillna(value=devices1.groupby(by=['brand_name'])['selfie_camera_mp'].transform('median'))
#filling null values of int_memory column with median of column values when organized by brand_name
devices1['int_memory'] = devices1['int_memory'].fillna(value=devices1.groupby(by=['brand_name'])['int_memory'].transform('median'))
#filling null values of ram column with median of column values when organized by brand_name
devices1['ram'] = devices1['ram'].fillna(value=devices1.groupby(by=['brand_name'])['ram'].transform('median'))
#filling null values of battery column with median of column values when organized by brand_name
devices1['battery'] = devices1['battery'].fillna(value=devices1.groupby(by=['brand_name'])['battery'].transform('median'))
#filling null values of weight column with median of column values when organized by brand_name
devices1['weight']=devices1['weight'].fillna(value=devices1.groupby(by=['brand_name'])['weight'].transform('median'))
#viewing the sum of null values in each column of devices1
devices1.isnull().sum()
brand_name 0 os 0 screen_size 0 4g 0 5g 0 main_camera_mp 10 selfie_camera_mp 0 int_memory 0 ram 0 battery 0 weight 0 release_year 0 days_used 0 normalized_used_price 0 normalized_new_price 0 dtype: int64
#locating the null values in the main_camera_mp column of devices 1
devices1.loc[devices1['main_camera_mp'].isnull()]
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | normalized_used_price | normalized_new_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 59 | Infinix | Android | 17.32 | yes | no | NaN | 8.0 | 32.0 | 2.0 | 6000.0 | 209.0 | 2020 | 245 | 4.282068 | 4.597642 |
| 60 | Infinix | Android | 15.39 | yes | no | NaN | 8.0 | 64.0 | 4.0 | 5000.0 | 185.0 | 2020 | 173 | 4.363608 | 4.711780 |
| 61 | Infinix | Android | 15.39 | yes | no | NaN | 8.0 | 32.0 | 2.0 | 5000.0 | 185.0 | 2020 | 256 | 4.181439 | 4.505460 |
| 62 | Infinix | Android | 15.39 | yes | no | NaN | 16.0 | 32.0 | 3.0 | 4000.0 | 178.0 | 2019 | 316 | 4.555244 | 4.602166 |
| 63 | Infinix | Android | 15.29 | yes | no | NaN | 16.0 | 32.0 | 2.0 | 4000.0 | 165.0 | 2019 | 468 | 4.416670 | 4.871297 |
| 278 | Infinix | Android | 17.32 | yes | no | NaN | 8.0 | 32.0 | 2.0 | 6000.0 | 209.0 | 2020 | 320 | 4.405133 | 4.605370 |
| 279 | Infinix | Android | 15.39 | yes | no | NaN | 8.0 | 64.0 | 4.0 | 5000.0 | 185.0 | 2020 | 173 | 4.495913 | 4.702115 |
| 280 | Infinix | Android | 15.39 | yes | no | NaN | 8.0 | 32.0 | 2.0 | 5000.0 | 185.0 | 2020 | 329 | 4.370713 | 4.487287 |
| 281 | Infinix | Android | 15.39 | yes | no | NaN | 16.0 | 32.0 | 3.0 | 4000.0 | 178.0 | 2019 | 356 | 4.417997 | 4.605970 |
| 282 | Infinix | Android | 15.29 | yes | no | NaN | 16.0 | 32.0 | 2.0 | 4000.0 | 165.0 | 2019 | 497 | 4.423289 | 4.866072 |
#filling in the remaining null values of the main_camera_mp column in devices1 using median values of the column when organized by os
devices1['main_camera_mp'] = devices1['main_camera_mp'].fillna(value=devices1.groupby(by=['os'])['main_camera_mp'].transform('median'))
#viewing the sum of null values in each column of devices1
devices1.isnull().sum()
brand_name 0 os 0 screen_size 0 4g 0 5g 0 main_camera_mp 0 selfie_camera_mp 0 int_memory 0 ram 0 battery 0 weight 0 release_year 0 days_used 0 normalized_used_price 0 normalized_new_price 0 dtype: int64
Feature Engineering is not needed for this dataset.
#creating a variable called devices1_numerical to contain only the numerical columns of devices1
devices1_numerical = devices1.select_dtypes(np.number)
#making boxplot for each numerical variable
for i in devices1_numerical.columns: #using for loop for each column in devices1_numerical
sns.boxplot(data=devices1_numerical, x=i) #making a boxplot for each variable using data from devices1_numerical
plt.title(i) #setting title of boxplot
plt.show(); #displaying boxplot
#the main variable in focus in the normalized_used_price, so that is Y
#X will hold the other variables
#X will be all the columns when the normalized_used_price column is dropped
X = devices1.drop(['normalized_used_price'], axis=1)
#y is the normalized_used_price column of the devices dataframe
y = devices1['normalized_used_price']
#viewing X
X
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | normalized_new_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Honor | Android | 14.50 | yes | no | 13.0 | 5.0 | 64.0 | 3.0 | 3020.0 | 146.0 | 2020 | 127 | 4.715100 |
| 1 | Honor | Android | 17.30 | yes | yes | 13.0 | 16.0 | 128.0 | 8.0 | 4300.0 | 213.0 | 2020 | 325 | 5.519018 |
| 2 | Honor | Android | 16.69 | yes | yes | 13.0 | 8.0 | 128.0 | 8.0 | 4200.0 | 213.0 | 2020 | 162 | 5.884631 |
| 3 | Honor | Android | 25.50 | yes | yes | 13.0 | 8.0 | 64.0 | 6.0 | 7250.0 | 480.0 | 2020 | 345 | 5.630961 |
| 4 | Honor | Android | 15.32 | yes | no | 13.0 | 8.0 | 64.0 | 3.0 | 5000.0 | 185.0 | 2020 | 293 | 4.947837 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3449 | Asus | Android | 15.34 | yes | no | 13.0 | 8.0 | 64.0 | 6.0 | 5000.0 | 190.0 | 2019 | 232 | 6.483872 |
| 3450 | Asus | Android | 15.24 | yes | no | 13.0 | 8.0 | 128.0 | 8.0 | 4000.0 | 200.0 | 2018 | 541 | 6.251538 |
| 3451 | Alcatel | Android | 15.80 | yes | no | 13.0 | 5.0 | 32.0 | 3.0 | 4000.0 | 165.0 | 2020 | 201 | 4.528829 |
| 3452 | Alcatel | Android | 15.80 | yes | no | 13.0 | 5.0 | 32.0 | 2.0 | 4000.0 | 160.0 | 2020 | 149 | 4.624188 |
| 3453 | Alcatel | Android | 12.83 | yes | no | 13.0 | 5.0 | 16.0 | 2.0 | 4000.0 | 168.0 | 2020 | 176 | 4.279994 |
3454 rows × 14 columns
#viewing y
y
0 4.307572
1 5.162097
2 5.111084
3 5.135387
4 4.389995
...
3449 4.492337
3450 5.037732
3451 4.357350
3452 4.349762
3453 4.132122
Name: normalized_used_price, Length: 3454, dtype: float64
#using sm.add_constant(X) function to add an intercept
X = sm.add_constant(X)
#adding dummy variables for the 4 categorical (object) variables in devices1 (brand_name, os, 4g, and 5g)
#first variable in the dummy variables is dropped for each object variable using drop_first=True
X = pd.get_dummies(X, columns = X.select_dtypes(include=['object']).columns.tolist(), drop_first=True)
#viewing X to make sure the dummy variables have been created
X
| const | screen_size | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | ... | brand_name_Spice | brand_name_Vivo | brand_name_XOLO | brand_name_Xiaomi | brand_name_ZTE | os_Others | os_Windows | os_iOS | 4g_yes | 5g_yes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | 14.50 | 13.0 | 5.0 | 64.0 | 3.0 | 3020.0 | 146.0 | 2020 | 127 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 1 | 1.0 | 17.30 | 13.0 | 16.0 | 128.0 | 8.0 | 4300.0 | 213.0 | 2020 | 325 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| 2 | 1.0 | 16.69 | 13.0 | 8.0 | 128.0 | 8.0 | 4200.0 | 213.0 | 2020 | 162 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| 3 | 1.0 | 25.50 | 13.0 | 8.0 | 64.0 | 6.0 | 7250.0 | 480.0 | 2020 | 345 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| 4 | 1.0 | 15.32 | 13.0 | 8.0 | 64.0 | 3.0 | 5000.0 | 185.0 | 2020 | 293 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3449 | 1.0 | 15.34 | 13.0 | 8.0 | 64.0 | 6.0 | 5000.0 | 190.0 | 2019 | 232 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 3450 | 1.0 | 15.24 | 13.0 | 8.0 | 128.0 | 8.0 | 4000.0 | 200.0 | 2018 | 541 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 3451 | 1.0 | 15.80 | 13.0 | 5.0 | 32.0 | 3.0 | 4000.0 | 165.0 | 2020 | 201 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 3452 | 1.0 | 15.80 | 13.0 | 5.0 | 32.0 | 2.0 | 4000.0 | 160.0 | 2020 | 149 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 3453 | 1.0 | 12.83 | 13.0 | 5.0 | 16.0 | 2.0 | 4000.0 | 168.0 | 2020 | 176 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
3454 rows × 49 columns
#dividing the data into train and test groups
#test_size is set to 0.3 to make 70:30 train to test ratio
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
#viewing the shape of x_train
x_train.shape[0]
2417
#viewing the shape of x_test
x_test.shape[0]
1037
The columns main_camera_mp, selfie_camera_mp, int_memory, ram, battery, and weight have been manipulated as the null values have been replaced. It is best to make sure that this did not make changes to the distribution of data.
#the columns where null values have been replaced are all numerical
#viewing histograms for the columns in the previously created variable devices1_numerical
#color is set to thistle
for i in devices1_numerical.columns: #for loop for each column in devices1_numerical
sns.histplot(data=devices1_numerical, x=i, color='thistle') #using data from devices1_numerical to plot variable on x-axis
plt.title(i) #setting title of histogram
plt.show(); #displaying histogram
There are no important changes in the distribution of data for the columns that have been manipulated.
#viewing any changes in correlation by using a heatmap
#using annot=True to display labels and range is from -1 to 1
#decimals are limited to 2 spaces, and color is in shades of blue
sns.heatmap(devices1_numerical.corr(), annot=True, vmin=-1, vmax=1, fmt='.2f', cmap='Blues')
plt.show(); #displaying heatmap
#building the linear regression model as olsmodel
olsmodel = sm.OLS(y_train, x_train).fit()
#printing the summary of the linear regression model
print(olsmodel.summary())
OLS Regression Results
=================================================================================
Dep. Variable: normalized_used_price R-squared: 0.845
Model: OLS Adj. R-squared: 0.842
Method: Least Squares F-statistic: 268.8
Date: Thu, 16 Nov 2023 Prob (F-statistic): 0.00
Time: 21:37:08 Log-Likelihood: 124.15
No. Observations: 2417 AIC: -150.3
Df Residuals: 2368 BIC: 133.4
Df Model: 48
Covariance Type: nonrobust
=========================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------
const -46.4646 9.197 -5.052 0.000 -64.500 -28.430
screen_size 0.0244 0.003 7.156 0.000 0.018 0.031
main_camera_mp 0.0208 0.002 13.848 0.000 0.018 0.024
selfie_camera_mp 0.0135 0.001 11.996 0.000 0.011 0.016
int_memory 0.0001 6.97e-05 1.664 0.096 -2.07e-05 0.000
ram 0.0232 0.005 4.515 0.000 0.013 0.033
battery -1.686e-05 7.27e-06 -2.318 0.021 -3.11e-05 -2.6e-06
weight 0.0010 0.000 7.488 0.000 0.001 0.001
release_year 0.0236 0.005 5.189 0.000 0.015 0.033
days_used 4.196e-05 3.09e-05 1.360 0.174 -1.85e-05 0.000
normalized_new_price 0.4309 0.012 35.134 0.000 0.407 0.455
brand_name_Alcatel 0.0154 0.048 0.324 0.746 -0.078 0.109
brand_name_Apple -0.0032 0.147 -0.021 0.983 -0.292 0.285
brand_name_Asus 0.0150 0.048 0.313 0.754 -0.079 0.109
brand_name_BlackBerry -0.0297 0.070 -0.423 0.672 -0.167 0.108
brand_name_Celkon -0.0463 0.066 -0.699 0.484 -0.176 0.084
brand_name_Coolpad 0.0209 0.073 0.286 0.775 -0.122 0.164
brand_name_Gionee 0.0447 0.058 0.775 0.438 -0.068 0.158
brand_name_Google -0.0327 0.085 -0.386 0.700 -0.199 0.133
brand_name_HTC -0.0131 0.048 -0.271 0.786 -0.108 0.081
brand_name_Honor 0.0316 0.049 0.642 0.521 -0.065 0.128
brand_name_Huawei -0.0022 0.044 -0.049 0.961 -0.089 0.085
brand_name_Infinix 0.0801 0.093 0.859 0.390 -0.103 0.263
brand_name_Karbonn 0.0943 0.067 1.406 0.160 -0.037 0.226
brand_name_LG -0.0132 0.045 -0.292 0.771 -0.102 0.076
brand_name_Lava 0.0332 0.062 0.533 0.594 -0.089 0.155
brand_name_Lenovo 0.0453 0.045 1.003 0.316 -0.043 0.134
brand_name_Meizu -0.0130 0.056 -0.232 0.817 -0.123 0.097
brand_name_Micromax -0.0337 0.048 -0.704 0.481 -0.128 0.060
brand_name_Microsoft 0.0947 0.088 1.072 0.284 -0.079 0.268
brand_name_Motorola -0.0113 0.050 -0.228 0.820 -0.109 0.086
brand_name_Nokia 0.0705 0.052 1.362 0.173 -0.031 0.172
brand_name_OnePlus 0.0707 0.077 0.913 0.361 -0.081 0.222
brand_name_Oppo 0.0124 0.048 0.259 0.796 -0.081 0.106
brand_name_Others -0.0080 0.042 -0.191 0.849 -0.091 0.074
brand_name_Panasonic 0.0562 0.056 1.006 0.314 -0.053 0.166
brand_name_Realme 0.0319 0.062 0.517 0.605 -0.089 0.153
brand_name_Samsung -0.0314 0.043 -0.726 0.468 -0.116 0.053
brand_name_Sony -0.0616 0.050 -1.221 0.222 -0.161 0.037
brand_name_Spice -0.0148 0.063 -0.234 0.815 -0.139 0.109
brand_name_Vivo -0.0155 0.048 -0.320 0.749 -0.111 0.080
brand_name_XOLO 0.0151 0.055 0.276 0.783 -0.092 0.123
brand_name_Xiaomi 0.0868 0.048 1.804 0.071 -0.008 0.181
brand_name_ZTE -0.0058 0.047 -0.122 0.903 -0.099 0.087
os_Others -0.0519 0.033 -1.585 0.113 -0.116 0.012
os_Windows -0.0202 0.045 -0.448 0.654 -0.109 0.068
os_iOS -0.0669 0.146 -0.457 0.648 -0.354 0.220
4g_yes 0.0530 0.016 3.341 0.001 0.022 0.084
5g_yes -0.0721 0.031 -2.292 0.022 -0.134 -0.010
==============================================================================
Omnibus: 223.220 Durbin-Watson: 1.911
Prob(Omnibus): 0.000 Jarque-Bera (JB): 422.514
Skew: -0.618 Prob(JB): 1.79e-92
Kurtosis: 4.633 Cond. No. 7.70e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.7e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
#assigning the R-squared of the model to r_squared variable
r_squared = olsmodel.rsquared
#assigning the Adjusted R-squared of the model to adjusted_r_squared variable
adjusted_r_squared = olsmodel.rsquared_adj
#assigning Mean Absolute Error for training set to training_mae variable
training_mae = mean_absolute_error(y_train, olsmodel.predict(x_train))
#assigning Mean Absolute Error for testing set to testing_mae variable
testing_mae = mean_absolute_error(y_test, olsmodel.predict(x_test))
#assigning Mean Square Error for training set to training_mse variable
training_mse = mean_squared_error(y_train, olsmodel.predict(x_train))
#assigning Mean Square Error for testing set to testing_mse variable
testing_mse = mean_squared_error(y_test, olsmodel.predict(x_test))
#assigning Root Mean Square Error for training set to training_rmse variable
training_rmse = np.sqrt(training_mse)
#assigning Root Mean Square Error for testing set to testing_rmse variable
testing_rmse = np.sqrt(testing_mse)
#creating a dataframe with all of the above performance metrics
#creating small lists of performance metric labels and the corresponding variables from above to make one whole list called performance
performance = [['R-squared', r_squared], ['Adjusted R-squared', adjusted_r_squared], ['Training MAE', training_mae],
['Testing MAE', testing_mae], ['Training MSE', training_mse],['Testing MSE', testing_mse],
['Training RMSE', training_rmse], ['Testing RMSE', testing_rmse]]
#creating a dataframe from the performance list and adding titles to the columns
model_performance = pd.DataFrame(performance, columns = ['Performance Metric', 'Value'])
model_performance #dataframe with the metrics and corresponding values
| Performance Metric | Value | |
|---|---|---|
| 0 | R-squared | 0.844924 |
| 1 | Adjusted R-squared | 0.841780 |
| 2 | Training MAE | 0.180302 |
| 3 | Testing MAE | 0.184868 |
| 4 | Training MSE | 0.052834 |
| 5 | Testing MSE | 0.056874 |
| 6 | Training RMSE | 0.229856 |
| 7 | Testing RMSE | 0.238482 |
#importing the variaition_inflation_factor function for calculating VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor
#creating a dataframe called vif
#one column called Variables holds the columns in x.train
#one column called Values holds the variation inflation factor for the corresponding columns
#using for loop to calculate vif with variation_inflation_factor() function for each column in x_train
vif=pd.DataFrame()
vif['Variables'] = x_train.columns
vif['Values'] = [variance_inflation_factor(x_train.values, i)
for i in range(len(x_train.columns))]
#to avoid displaying scientific notation for large numbers, the format is set to 6 decimal spaces
pd.options.display.float_format = '{:.6f}'.format
vif #displaying vif dataframe
| Variables | Values | |
|---|---|---|
| 0 | const | 3791081.976721 |
| 1 | screen_size | 7.677118 |
| 2 | main_camera_mp | 2.285811 |
| 3 | selfie_camera_mp | 2.810716 |
| 4 | int_memory | 1.364057 |
| 5 | ram | 2.255246 |
| 6 | battery | 4.081715 |
| 7 | weight | 6.397183 |
| 8 | release_year | 4.898360 |
| 9 | days_used | 2.660002 |
| 10 | normalized_new_price | 3.119569 |
| 11 | brand_name_Alcatel | 3.405674 |
| 12 | brand_name_Apple | 13.054832 |
| 13 | brand_name_Asus | 3.332003 |
| 14 | brand_name_BlackBerry | 1.632250 |
| 15 | brand_name_Celkon | 1.774820 |
| 16 | brand_name_Coolpad | 1.467981 |
| 17 | brand_name_Gionee | 1.951247 |
| 18 | brand_name_Google | 1.321771 |
| 19 | brand_name_HTC | 3.410252 |
| 20 | brand_name_Honor | 3.340621 |
| 21 | brand_name_Huawei | 5.983857 |
| 22 | brand_name_Infinix | 1.285635 |
| 23 | brand_name_Karbonn | 1.573683 |
| 24 | brand_name_LG | 4.849589 |
| 25 | brand_name_Lava | 1.711317 |
| 26 | brand_name_Lenovo | 4.558847 |
| 27 | brand_name_Meizu | 2.179607 |
| 28 | brand_name_Micromax | 3.363518 |
| 29 | brand_name_Microsoft | 1.869558 |
| 30 | brand_name_Motorola | 3.274455 |
| 31 | brand_name_Nokia | 3.473140 |
| 32 | brand_name_OnePlus | 1.437047 |
| 33 | brand_name_Oppo | 3.971065 |
| 34 | brand_name_Others | 9.710921 |
| 35 | brand_name_Panasonic | 2.105711 |
| 36 | brand_name_Realme | 1.946675 |
| 37 | brand_name_Samsung | 7.539832 |
| 38 | brand_name_Sony | 2.943127 |
| 39 | brand_name_Spice | 1.688868 |
| 40 | brand_name_Vivo | 3.651320 |
| 41 | brand_name_XOLO | 2.138074 |
| 42 | brand_name_Xiaomi | 3.719678 |
| 43 | brand_name_ZTE | 3.797527 |
| 44 | os_Others | 1.854134 |
| 45 | os_Windows | 1.595291 |
| 46 | os_iOS | 11.780766 |
| 47 | 4g_yes | 2.468374 |
| 48 | 5g_yes | 1.811042 |
#the highest VIF is seen in the brand_name_Apple column
#creating new set called x_train1 by dropping brand_name_Apple column from x_train set
x_train1 = x_train.drop(['brand_name_Apple'],axis = 1)
olsmodel1 = sm.OLS(y_train, x_train1).fit() #building a linear regression model called olsmodel1 using x_train1 with y_train
print('R-squared', olsmodel1.rsquared) #printing R-squared for model
print('Adjusted R-squared', olsmodel1.rsquared_adj) #printing Adjusted R-squared for model
R-squared 0.8449234815864929 Adjusted R-squared 0.8418468263034896
brand_name_Apple has the highest VIF when compared to other variables, and dropping it does not have a much of an impact on the R-squared and Adjusted R-squared.
#removing the brand_name_Apple column from original x_train
x_train = x_train.drop(['brand_name_Apple'], axis=1)
#removing brand_name column from x_test as well
x_test = x_test.drop(['brand_name_Apple'], axis=1)
#building a new model called olsmodel2 for the modified x_train with y_train
olsmodel2 = sm.OLS(y_train, x_train).fit()
print(olsmodel2.summary()) #printing the summary of olsmodel2
OLS Regression Results
=================================================================================
Dep. Variable: normalized_used_price R-squared: 0.845
Model: OLS Adj. R-squared: 0.842
Method: Least Squares F-statistic: 274.6
Date: Thu, 16 Nov 2023 Prob (F-statistic): 0.00
Time: 21:37:09 Log-Likelihood: 124.15
No. Observations: 2417 AIC: -152.3
Df Residuals: 2369 BIC: 125.6
Df Model: 47
Covariance Type: nonrobust
=========================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------
const -46.4564 9.187 -5.057 0.000 -64.472 -28.441
screen_size 0.0244 0.003 7.172 0.000 0.018 0.031
main_camera_mp 0.0208 0.002 13.852 0.000 0.018 0.024
selfie_camera_mp 0.0135 0.001 12.046 0.000 0.011 0.016
int_memory 0.0001 6.97e-05 1.664 0.096 -2.07e-05 0.000
ram 0.0232 0.005 4.522 0.000 0.013 0.033
battery -1.686e-05 7.27e-06 -2.320 0.020 -3.11e-05 -2.61e-06
weight 0.0010 0.000 7.492 0.000 0.001 0.001
release_year 0.0236 0.005 5.194 0.000 0.015 0.033
days_used 4.197e-05 3.08e-05 1.361 0.174 -1.85e-05 0.000
normalized_new_price 0.4309 0.012 35.238 0.000 0.407 0.455
brand_name_Alcatel 0.0157 0.046 0.338 0.735 -0.075 0.107
brand_name_Asus 0.0153 0.047 0.328 0.743 -0.076 0.107
brand_name_BlackBerry -0.0294 0.069 -0.428 0.668 -0.164 0.105
brand_name_Celkon -0.0461 0.065 -0.705 0.481 -0.174 0.082
brand_name_Coolpad 0.0211 0.072 0.293 0.770 -0.120 0.162
brand_name_Gionee 0.0450 0.057 0.792 0.428 -0.066 0.156
brand_name_Google -0.0324 0.084 -0.387 0.699 -0.196 0.132
brand_name_HTC -0.0129 0.047 -0.274 0.784 -0.105 0.079
brand_name_Honor 0.0318 0.048 0.666 0.506 -0.062 0.126
brand_name_Huawei -0.0019 0.043 -0.045 0.964 -0.086 0.082
brand_name_Infinix 0.0803 0.093 0.868 0.385 -0.101 0.262
brand_name_Karbonn 0.0945 0.066 1.422 0.155 -0.036 0.225
brand_name_LG -0.0130 0.044 -0.295 0.768 -0.099 0.073
brand_name_Lava 0.0334 0.062 0.542 0.588 -0.087 0.154
brand_name_Lenovo 0.0456 0.044 1.039 0.299 -0.040 0.132
brand_name_Meizu -0.0127 0.055 -0.232 0.817 -0.120 0.095
brand_name_Micromax -0.0335 0.047 -0.716 0.474 -0.125 0.058
brand_name_Microsoft 0.0949 0.087 1.085 0.278 -0.077 0.266
brand_name_Motorola -0.0111 0.048 -0.229 0.819 -0.106 0.084
brand_name_Nokia 0.0708 0.050 1.412 0.158 -0.028 0.169
brand_name_OnePlus 0.0709 0.076 0.928 0.354 -0.079 0.221
brand_name_Oppo 0.0126 0.046 0.271 0.786 -0.078 0.104
brand_name_Others -0.0078 0.041 -0.192 0.848 -0.088 0.072
brand_name_Panasonic 0.0564 0.055 1.029 0.304 -0.051 0.164
brand_name_Realme 0.0321 0.060 0.531 0.596 -0.087 0.151
brand_name_Samsung -0.0311 0.042 -0.748 0.455 -0.113 0.050
brand_name_Sony -0.0614 0.049 -1.247 0.212 -0.158 0.035
brand_name_Spice -0.0146 0.063 -0.234 0.815 -0.137 0.108
brand_name_Vivo -0.0152 0.047 -0.324 0.746 -0.108 0.077
brand_name_XOLO 0.0153 0.054 0.284 0.776 -0.090 0.121
brand_name_Xiaomi 0.0870 0.047 1.862 0.063 -0.005 0.179
brand_name_ZTE -0.0056 0.046 -0.121 0.904 -0.096 0.085
os_Others -0.0521 0.032 -1.645 0.100 -0.114 0.010
os_Windows -0.0202 0.045 -0.449 0.653 -0.109 0.068
os_iOS -0.0698 0.059 -1.185 0.236 -0.185 0.046
4g_yes 0.0530 0.016 3.342 0.001 0.022 0.084
5g_yes -0.0721 0.031 -2.296 0.022 -0.134 -0.011
==============================================================================
Omnibus: 223.213 Durbin-Watson: 1.911
Prob(Omnibus): 0.000 Jarque-Bera (JB): 422.547
Skew: -0.618 Prob(JB): 1.76e-92
Kurtosis: 4.633 Cond. No. 7.69e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.69e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
#rechecking the variation inflation factor for the columns in x_train
#creating a dataframe called vif
#one column called Variables holds the columns in x.train
#one column called Values holds the variation inflation factor for the corresponding columns
#using for loop to calculate vif with variation_inflation_factor() function for each column in x_train
vif=pd.DataFrame()
vif['Variables'] = x_train.columns
vif['Values'] = [variance_inflation_factor(x_train.values, i)
for i in range(len(x_train.columns))]
vif #displaying vif dataframe
| Variables | Values | |
|---|---|---|
| 0 | const | 3784495.420947 |
| 1 | screen_size | 7.643504 |
| 2 | main_camera_mp | 2.285201 |
| 3 | selfie_camera_mp | 2.789591 |
| 4 | int_memory | 1.364046 |
| 5 | ram | 2.247171 |
| 6 | battery | 4.079641 |
| 7 | weight | 6.394451 |
| 8 | release_year | 4.889730 |
| 9 | days_used | 2.659521 |
| 10 | normalized_new_price | 3.102357 |
| 11 | brand_name_Alcatel | 3.230621 |
| 12 | brand_name_Asus | 3.144923 |
| 13 | brand_name_BlackBerry | 1.561068 |
| 14 | brand_name_Celkon | 1.731859 |
| 15 | brand_name_Coolpad | 1.436792 |
| 16 | brand_name_Gionee | 1.886305 |
| 17 | brand_name_Google | 1.293140 |
| 18 | brand_name_HTC | 3.240385 |
| 19 | brand_name_Honor | 3.159809 |
| 20 | brand_name_Huawei | 5.581499 |
| 21 | brand_name_Infinix | 1.266884 |
| 22 | brand_name_Karbonn | 1.544215 |
| 23 | brand_name_LG | 4.564980 |
| 24 | brand_name_Lava | 1.670716 |
| 25 | brand_name_Lenovo | 4.291535 |
| 26 | brand_name_Meizu | 2.092794 |
| 27 | brand_name_Micromax | 3.214160 |
| 28 | brand_name_Microsoft | 1.835387 |
| 29 | brand_name_Motorola | 3.109598 |
| 30 | brand_name_Nokia | 3.258328 |
| 31 | brand_name_OnePlus | 1.402472 |
| 32 | brand_name_Oppo | 3.762609 |
| 33 | brand_name_Others | 9.075017 |
| 34 | brand_name_Panasonic | 2.030879 |
| 35 | brand_name_Realme | 1.878667 |
| 36 | brand_name_Samsung | 6.991243 |
| 37 | brand_name_Sony | 2.799041 |
| 38 | brand_name_Spice | 1.655390 |
| 39 | brand_name_Vivo | 3.447542 |
| 40 | brand_name_XOLO | 2.069721 |
| 41 | brand_name_Xiaomi | 3.512985 |
| 42 | brand_name_ZTE | 3.604237 |
| 43 | os_Others | 1.734615 |
| 44 | os_Windows | 1.593031 |
| 45 | os_iOS | 1.908364 |
| 46 | 4g_yes | 2.467374 |
| 47 | 5g_yes | 1.802632 |
#the highest VIF value is seen in the brand_name_Others column
#creating new set called x_train2 by dropping brand_name_Others column from x_train set
x_train2 = x_train.drop(['brand_name_Others'],axis = 1)
olsmodel3 = sm.OLS(y_train, x_train2).fit() #building a linear regression model called olsmodel3 using x_train2 with y_train
print('R-squared', olsmodel3.rsquared) #printing R-squared for model
print('Adjusted R-squared', olsmodel3.rsquared_adj) #printing Adjusted R-squared for model
R-squared 0.8449210795185332 Adjusted R-squared 0.8419111089100322
brand_name_Others has the next highest VIF when compared to the other variables, and dropping it does not have much of an impact on R-squared and Adjusted R-squared.
#removing the brand_name_Others column from original x_train
x_train = x_train.drop(['brand_name_Others'], axis=1)
#removing brand_name_Others column from x_test as well
x_test = x_test.drop(['brand_name_Others'], axis=1)
#building a new model called olsmodel4 for the modified x_train with y_train
olsmodel4 = sm.OLS(y_train, x_train).fit()
print(olsmodel4.summary()) #printing the summary of olsmodel4
OLS Regression Results
=================================================================================
Dep. Variable: normalized_used_price R-squared: 0.845
Model: OLS Adj. R-squared: 0.842
Method: Least Squares F-statistic: 280.7
Date: Thu, 16 Nov 2023 Prob (F-statistic): 0.00
Time: 21:37:10 Log-Likelihood: 124.13
No. Observations: 2417 AIC: -154.3
Df Residuals: 2370 BIC: 117.9
Df Model: 46
Covariance Type: nonrobust
=========================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------
const -46.4460 9.185 -5.057 0.000 -64.458 -28.434
screen_size 0.0244 0.003 7.227 0.000 0.018 0.031
main_camera_mp 0.0208 0.002 13.855 0.000 0.018 0.024
selfie_camera_mp 0.0135 0.001 12.047 0.000 0.011 0.016
int_memory 0.0001 6.97e-05 1.661 0.097 -2.09e-05 0.000
ram 0.0232 0.005 4.534 0.000 0.013 0.033
battery -1.685e-05 7.27e-06 -2.319 0.020 -3.11e-05 -2.6e-06
weight 0.0010 0.000 7.496 0.000 0.001 0.001
release_year 0.0236 0.005 5.193 0.000 0.015 0.033
days_used 4.203e-05 3.08e-05 1.363 0.173 -1.84e-05 0.000
normalized_new_price 0.4309 0.012 35.255 0.000 0.407 0.455
brand_name_Alcatel 0.0227 0.028 0.797 0.426 -0.033 0.078
brand_name_Asus 0.0223 0.029 0.775 0.439 -0.034 0.079
brand_name_BlackBerry -0.0224 0.058 -0.386 0.700 -0.137 0.092
brand_name_Celkon -0.0390 0.054 -0.724 0.469 -0.145 0.067
brand_name_Coolpad 0.0282 0.062 0.457 0.647 -0.093 0.149
brand_name_Gionee 0.0520 0.043 1.206 0.228 -0.033 0.137
brand_name_Google -0.0254 0.075 -0.337 0.736 -0.173 0.122
brand_name_HTC -0.0057 0.029 -0.200 0.842 -0.062 0.051
brand_name_Honor 0.0389 0.030 1.278 0.202 -0.021 0.099
brand_name_Huawei 0.0051 0.022 0.230 0.818 -0.039 0.049
brand_name_Infinix 0.0875 0.085 1.033 0.302 -0.079 0.254
brand_name_Karbonn 0.1015 0.055 1.833 0.067 -0.007 0.210
brand_name_LG -0.0059 0.024 -0.248 0.804 -0.053 0.041
brand_name_Lava 0.0405 0.049 0.821 0.412 -0.056 0.137
brand_name_Lenovo 0.0526 0.024 2.180 0.029 0.005 0.100
brand_name_Meizu -0.0056 0.040 -0.139 0.889 -0.085 0.074
brand_name_Micromax -0.0264 0.029 -0.915 0.360 -0.083 0.030
brand_name_Microsoft 0.1021 0.079 1.294 0.196 -0.053 0.257
brand_name_Motorola -0.0039 0.031 -0.128 0.898 -0.064 0.056
brand_name_Nokia 0.0779 0.034 2.301 0.021 0.011 0.144
brand_name_OnePlus 0.0780 0.067 1.162 0.245 -0.054 0.210
brand_name_Oppo 0.0197 0.028 0.702 0.483 -0.035 0.075
brand_name_Panasonic 0.0635 0.040 1.570 0.117 -0.016 0.143
brand_name_Realme 0.0392 0.048 0.823 0.410 -0.054 0.133
brand_name_Samsung -0.0241 0.020 -1.220 0.222 -0.063 0.015
brand_name_Sony -0.0543 0.032 -1.677 0.094 -0.118 0.009
brand_name_Spice -0.0076 0.051 -0.149 0.881 -0.107 0.092
brand_name_Vivo -0.0082 0.029 -0.280 0.779 -0.065 0.049
brand_name_XOLO 0.0223 0.040 0.564 0.573 -0.055 0.100
brand_name_Xiaomi 0.0941 0.029 3.293 0.001 0.038 0.150
brand_name_ZTE 0.0016 0.027 0.057 0.955 -0.052 0.055
os_Others -0.0516 0.032 -1.635 0.102 -0.114 0.010
os_Windows -0.0204 0.045 -0.453 0.650 -0.109 0.068
os_iOS -0.0628 0.046 -1.358 0.175 -0.154 0.028
4g_yes 0.0529 0.016 3.337 0.001 0.022 0.084
5g_yes -0.0722 0.031 -2.302 0.021 -0.134 -0.011
==============================================================================
Omnibus: 223.194 Durbin-Watson: 1.911
Prob(Omnibus): 0.000 Jarque-Bera (JB): 422.387
Skew: -0.618 Prob(JB): 1.90e-92
Kurtosis: 4.633 Cond. No. 7.69e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.69e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
#rechecking the variation inflation factor for the columns in x_train
#creating a dataframe called vif
#one column called Variables holds the columns in x.train
#one column called Values holds the variation inflation factor for the corresponding columns
#using for loop to calculate vif with variation_inflation_factor() function for each column in x_train
vif=pd.DataFrame()
vif['Variables'] = x_train.columns
vif['Values'] = [variance_inflation_factor(x_train.values, i)
for i in range(len(x_train.columns))]
vif #displaying vif dataframe
| Variables | Values | |
|---|---|---|
| 0 | const | 3784363.592994 |
| 1 | screen_size | 7.570153 |
| 2 | main_camera_mp | 2.283728 |
| 3 | selfie_camera_mp | 2.787818 |
| 4 | int_memory | 1.363394 |
| 5 | ram | 2.243844 |
| 6 | battery | 4.079417 |
| 7 | weight | 6.368547 |
| 8 | release_year | 4.889201 |
| 9 | days_used | 2.659263 |
| 10 | normalized_new_price | 3.101257 |
| 11 | brand_name_Alcatel | 1.217785 |
| 12 | brand_name_Asus | 1.200109 |
| 13 | brand_name_BlackBerry | 1.123993 |
| 14 | brand_name_Celkon | 1.174580 |
| 15 | brand_name_Coolpad | 1.053575 |
| 16 | brand_name_Gionee | 1.090929 |
| 17 | brand_name_Google | 1.044519 |
| 18 | brand_name_HTC | 1.220945 |
| 19 | brand_name_Honor | 1.280959 |
| 20 | brand_name_Huawei | 1.500803 |
| 21 | brand_name_Infinix | 1.061884 |
| 22 | brand_name_Karbonn | 1.073813 |
| 23 | brand_name_LG | 1.344151 |
| 24 | brand_name_Lava | 1.071000 |
| 25 | brand_name_Lenovo | 1.298072 |
| 26 | brand_name_Meizu | 1.134630 |
| 27 | brand_name_Micromax | 1.227730 |
| 28 | brand_name_Microsoft | 1.494129 |
| 29 | brand_name_Motorola | 1.256765 |
| 30 | brand_name_Nokia | 1.485057 |
| 31 | brand_name_OnePlus | 1.079844 |
| 32 | brand_name_Oppo | 1.371004 |
| 33 | brand_name_Panasonic | 1.106442 |
| 34 | brand_name_Realme | 1.167394 |
| 35 | brand_name_Samsung | 1.576652 |
| 36 | brand_name_Sony | 1.212002 |
| 37 | brand_name_Spice | 1.080742 |
| 38 | brand_name_Vivo | 1.324005 |
| 39 | brand_name_XOLO | 1.118852 |
| 40 | brand_name_Xiaomi | 1.313746 |
| 41 | brand_name_ZTE | 1.263080 |
| 42 | os_Others | 1.723014 |
| 43 | os_Windows | 1.592335 |
| 44 | os_iOS | 1.176653 |
| 45 | 4g_yes | 2.462358 |
| 46 | 5g_yes | 1.801670 |
#the highest VIF value is seen in the screen_size column
#creating new set called x_train3 by dropping screen_size column from x_train set
x_train3 = x_train.drop(['screen_size'],axis = 1)
olsmodel5 = sm.OLS(y_train, x_train2).fit() #building a linear regression model called olsmodel5 using x_train3 with y_train
print('R-squared', olsmodel5.rsquared) #printing R-squared for model
print('Adjusted R-squared', olsmodel5.rsquared_adj) #printing Adjusted R-squared for model
R-squared 0.8449210795185332 Adjusted R-squared 0.8419111089100322
screen_size has the next highest VIF, and dropping it does not have much of an effect on R-squared and Adjusted R-squared.
#removing the screen_size column from original x_train
x_train = x_train.drop(['screen_size'], axis=1)
#removing screen_size column from x_test as well
x_test = x_test.drop(['screen_size'], axis=1)
#building a new model called olsmodel6 for the modified x_train with y_train
olsmodel6 = sm.OLS(y_train, x_train).fit()
print(olsmodel4.summary()) #printing the summary of olsmodel4
OLS Regression Results
=================================================================================
Dep. Variable: normalized_used_price R-squared: 0.845
Model: OLS Adj. R-squared: 0.842
Method: Least Squares F-statistic: 280.7
Date: Thu, 16 Nov 2023 Prob (F-statistic): 0.00
Time: 21:37:11 Log-Likelihood: 124.13
No. Observations: 2417 AIC: -154.3
Df Residuals: 2370 BIC: 117.9
Df Model: 46
Covariance Type: nonrobust
=========================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------
const -46.4460 9.185 -5.057 0.000 -64.458 -28.434
screen_size 0.0244 0.003 7.227 0.000 0.018 0.031
main_camera_mp 0.0208 0.002 13.855 0.000 0.018 0.024
selfie_camera_mp 0.0135 0.001 12.047 0.000 0.011 0.016
int_memory 0.0001 6.97e-05 1.661 0.097 -2.09e-05 0.000
ram 0.0232 0.005 4.534 0.000 0.013 0.033
battery -1.685e-05 7.27e-06 -2.319 0.020 -3.11e-05 -2.6e-06
weight 0.0010 0.000 7.496 0.000 0.001 0.001
release_year 0.0236 0.005 5.193 0.000 0.015 0.033
days_used 4.203e-05 3.08e-05 1.363 0.173 -1.84e-05 0.000
normalized_new_price 0.4309 0.012 35.255 0.000 0.407 0.455
brand_name_Alcatel 0.0227 0.028 0.797 0.426 -0.033 0.078
brand_name_Asus 0.0223 0.029 0.775 0.439 -0.034 0.079
brand_name_BlackBerry -0.0224 0.058 -0.386 0.700 -0.137 0.092
brand_name_Celkon -0.0390 0.054 -0.724 0.469 -0.145 0.067
brand_name_Coolpad 0.0282 0.062 0.457 0.647 -0.093 0.149
brand_name_Gionee 0.0520 0.043 1.206 0.228 -0.033 0.137
brand_name_Google -0.0254 0.075 -0.337 0.736 -0.173 0.122
brand_name_HTC -0.0057 0.029 -0.200 0.842 -0.062 0.051
brand_name_Honor 0.0389 0.030 1.278 0.202 -0.021 0.099
brand_name_Huawei 0.0051 0.022 0.230 0.818 -0.039 0.049
brand_name_Infinix 0.0875 0.085 1.033 0.302 -0.079 0.254
brand_name_Karbonn 0.1015 0.055 1.833 0.067 -0.007 0.210
brand_name_LG -0.0059 0.024 -0.248 0.804 -0.053 0.041
brand_name_Lava 0.0405 0.049 0.821 0.412 -0.056 0.137
brand_name_Lenovo 0.0526 0.024 2.180 0.029 0.005 0.100
brand_name_Meizu -0.0056 0.040 -0.139 0.889 -0.085 0.074
brand_name_Micromax -0.0264 0.029 -0.915 0.360 -0.083 0.030
brand_name_Microsoft 0.1021 0.079 1.294 0.196 -0.053 0.257
brand_name_Motorola -0.0039 0.031 -0.128 0.898 -0.064 0.056
brand_name_Nokia 0.0779 0.034 2.301 0.021 0.011 0.144
brand_name_OnePlus 0.0780 0.067 1.162 0.245 -0.054 0.210
brand_name_Oppo 0.0197 0.028 0.702 0.483 -0.035 0.075
brand_name_Panasonic 0.0635 0.040 1.570 0.117 -0.016 0.143
brand_name_Realme 0.0392 0.048 0.823 0.410 -0.054 0.133
brand_name_Samsung -0.0241 0.020 -1.220 0.222 -0.063 0.015
brand_name_Sony -0.0543 0.032 -1.677 0.094 -0.118 0.009
brand_name_Spice -0.0076 0.051 -0.149 0.881 -0.107 0.092
brand_name_Vivo -0.0082 0.029 -0.280 0.779 -0.065 0.049
brand_name_XOLO 0.0223 0.040 0.564 0.573 -0.055 0.100
brand_name_Xiaomi 0.0941 0.029 3.293 0.001 0.038 0.150
brand_name_ZTE 0.0016 0.027 0.057 0.955 -0.052 0.055
os_Others -0.0516 0.032 -1.635 0.102 -0.114 0.010
os_Windows -0.0204 0.045 -0.453 0.650 -0.109 0.068
os_iOS -0.0628 0.046 -1.358 0.175 -0.154 0.028
4g_yes 0.0529 0.016 3.337 0.001 0.022 0.084
5g_yes -0.0722 0.031 -2.302 0.021 -0.134 -0.011
==============================================================================
Omnibus: 223.194 Durbin-Watson: 1.911
Prob(Omnibus): 0.000 Jarque-Bera (JB): 422.387
Skew: -0.618 Prob(JB): 1.90e-92
Kurtosis: 4.633 Cond. No. 7.69e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.69e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
#rechecking the variation inflation factor for the columns in x_train
#creating a dataframe called vif
#one column called Variables holds the columns in x.train
#one column called Values holds the variation inflation factor for the corresponding columns
#using for loop to calculate vif with variation_inflation_factor() function for each column in x_train
vif=pd.DataFrame()
vif['Variables'] = x_train.columns
vif['Values'] = [variance_inflation_factor(x_train.values, i)
for i in range(len(x_train.columns))]
vif #displaying vif dataframe
| Variables | Values | |
|---|---|---|
| 0 | const | 3652189.313723 |
| 1 | main_camera_mp | 2.280685 |
| 2 | selfie_camera_mp | 2.785809 |
| 3 | int_memory | 1.361090 |
| 4 | ram | 2.243688 |
| 5 | battery | 3.832812 |
| 6 | weight | 2.988550 |
| 7 | release_year | 4.714847 |
| 8 | days_used | 2.647808 |
| 9 | normalized_new_price | 3.054145 |
| 10 | brand_name_Alcatel | 1.202265 |
| 11 | brand_name_Asus | 1.198891 |
| 12 | brand_name_BlackBerry | 1.123884 |
| 13 | brand_name_Celkon | 1.172478 |
| 14 | brand_name_Coolpad | 1.052543 |
| 15 | brand_name_Gionee | 1.090087 |
| 16 | brand_name_Google | 1.044376 |
| 17 | brand_name_HTC | 1.220847 |
| 18 | brand_name_Honor | 1.265332 |
| 19 | brand_name_Huawei | 1.490147 |
| 20 | brand_name_Infinix | 1.061741 |
| 21 | brand_name_Karbonn | 1.068827 |
| 22 | brand_name_LG | 1.344073 |
| 23 | brand_name_Lava | 1.068851 |
| 24 | brand_name_Lenovo | 1.294022 |
| 25 | brand_name_Meizu | 1.134233 |
| 26 | brand_name_Micromax | 1.225923 |
| 27 | brand_name_Microsoft | 1.494091 |
| 28 | brand_name_Motorola | 1.256747 |
| 29 | brand_name_Nokia | 1.484503 |
| 30 | brand_name_OnePlus | 1.078405 |
| 31 | brand_name_Oppo | 1.368931 |
| 32 | brand_name_Panasonic | 1.104229 |
| 33 | brand_name_Realme | 1.167275 |
| 34 | brand_name_Samsung | 1.574092 |
| 35 | brand_name_Sony | 1.211511 |
| 36 | brand_name_Spice | 1.080580 |
| 37 | brand_name_Vivo | 1.316824 |
| 38 | brand_name_XOLO | 1.114832 |
| 39 | brand_name_Xiaomi | 1.312174 |
| 40 | brand_name_ZTE | 1.262301 |
| 41 | os_Others | 1.517099 |
| 42 | os_Windows | 1.592227 |
| 43 | os_iOS | 1.175537 |
| 44 | 4g_yes | 2.461295 |
| 45 | 5g_yes | 1.796667 |
All of the variables have a VIF that is less than 5 now.
#Viewing the variables with high p-values because they do not have a significant effect on prediction
#Identifying variables that have p-value greater than 5
olsmodel6.pvalues[olsmodel6.pvalues > .05]
int_memory 0.177078 battery 0.580329 days_used 0.378412 brand_name_Alcatel 0.108414 brand_name_Asus 0.320032 brand_name_BlackBerry 0.755715 brand_name_Celkon 0.678724 brand_name_Coolpad 0.498662 brand_name_Gionee 0.320038 brand_name_Google 0.676405 brand_name_HTC 0.893951 brand_name_Huawei 0.404902 brand_name_Infinix 0.269453 brand_name_LG 0.848910 brand_name_Lava 0.257059 brand_name_Meizu 0.996814 brand_name_Micromax 0.527715 brand_name_Microsoft 0.188100 brand_name_Motorola 0.877901 brand_name_OnePlus 0.158124 brand_name_Oppo 0.330456 brand_name_Panasonic 0.060981 brand_name_Realme 0.375372 brand_name_Samsung 0.357625 brand_name_Sony 0.129866 brand_name_Spice 0.814073 brand_name_Vivo 0.802564 brand_name_XOLO 0.323316 brand_name_ZTE 0.815072 os_Windows 0.696827 os_iOS 0.117809 dtype: float64
#dropping all the variables that have p-value greater than 5 from x-train
x_train = x_train.drop(['int_memory', 'battery', 'days_used','os_Windows','os_iOS','brand_name_Alcatel', 'brand_name_Asus',
'brand_name_BlackBerry', 'brand_name_Celkon', 'brand_name_Coolpad',
'brand_name_Gionee', 'brand_name_Google', 'brand_name_HTC',
'brand_name_Huawei', 'brand_name_Infinix','brand_name_LG','brand_name_Lava',
'brand_name_Meizu', 'brand_name_Micromax', 'brand_name_Microsoft','brand_name_Motorola',
'brand_name_OnePlus','brand_name_Oppo','brand_name_Panasonic','brand_name_Realme','brand_name_Samsung',
'brand_name_Sony','brand_name_Spice','brand_name_Vivo', 'brand_name_XOLO','brand_name_ZTE'], axis=1)
#dropping those columns from x_test as well
x_test = x_test.drop(['int_memory', 'battery', 'days_used','os_Windows','os_iOS','brand_name_Alcatel', 'brand_name_Asus',
'brand_name_BlackBerry', 'brand_name_Celkon', 'brand_name_Coolpad',
'brand_name_Gionee', 'brand_name_Google', 'brand_name_HTC',
'brand_name_Huawei', 'brand_name_Infinix','brand_name_LG','brand_name_Lava',
'brand_name_Meizu', 'brand_name_Micromax', 'brand_name_Microsoft','brand_name_Motorola',
'brand_name_OnePlus','brand_name_Oppo','brand_name_Panasonic','brand_name_Realme','brand_name_Samsung',
'brand_name_Sony','brand_name_Spice','brand_name_Vivo', 'brand_name_XOLO','brand_name_ZTE'], axis=1)
#building a new model called olsmodel7 for the modified x_train with y_train
olsmodel7 = sm.OLS(y_train, x_train).fit()
print(olsmodel7.summary()) #printing the summary of olsmodel7
OLS Regression Results
=================================================================================
Dep. Variable: normalized_used_price R-squared: 0.840
Model: OLS Adj. R-squared: 0.839
Method: Least Squares F-statistic: 897.6
Date: Thu, 16 Nov 2023 Prob (F-statistic): 0.00
Time: 21:37:11 Log-Likelihood: 82.834
No. Observations: 2417 AIC: -135.7
Df Residuals: 2402 BIC: -48.81
Df Model: 14
Covariance Type: nonrobust
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
const -55.6229 6.880 -8.085 0.000 -69.114 -42.132
main_camera_mp 0.0212 0.001 15.112 0.000 0.018 0.024
selfie_camera_mp 0.0142 0.001 13.339 0.000 0.012 0.016
ram 0.0230 0.005 4.578 0.000 0.013 0.033
weight 0.0016 6.04e-05 27.250 0.000 0.002 0.002
release_year 0.0283 0.003 8.287 0.000 0.022 0.035
normalized_new_price 0.4351 0.011 40.043 0.000 0.414 0.456
brand_name_Honor 0.0526 0.028 1.897 0.058 -0.002 0.107
brand_name_Karbonn 0.1230 0.055 2.248 0.025 0.016 0.230
brand_name_Lenovo 0.0554 0.022 2.557 0.011 0.013 0.098
brand_name_Nokia 0.0626 0.031 2.032 0.042 0.002 0.123
brand_name_Xiaomi 0.0930 0.026 3.615 0.000 0.043 0.143
os_Others -0.1429 0.028 -5.047 0.000 -0.198 -0.087
4g_yes 0.0459 0.015 3.053 0.002 0.016 0.075
5g_yes -0.0673 0.031 -2.202 0.028 -0.127 -0.007
==============================================================================
Omnibus: 242.954 Durbin-Watson: 1.915
Prob(Omnibus): 0.000 Jarque-Bera (JB): 470.533
Skew: -0.656 Prob(JB): 6.68e-103
Kurtosis: 4.718 Cond. No. 2.92e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.92e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
#brand_name_Honor has a p-value greater than 5
#dropping brand_name_Honor from x_train
x_train = x_train.drop(['brand_name_Honor'], axis=1)
#removing brand_name_Honor column from x_test as well
x_test = x_test.drop(['brand_name_Honor'], axis=1)
#building a new model called olsmodel8 for the modified x_train with y_train
olsmodel8 = sm.OLS(y_train, x_train).fit()
print(olsmodel8.summary()) #printing the summary of olsmodel8
OLS Regression Results
=================================================================================
Dep. Variable: normalized_used_price R-squared: 0.839
Model: OLS Adj. R-squared: 0.838
Method: Least Squares F-statistic: 965.3
Date: Thu, 16 Nov 2023 Prob (F-statistic): 0.00
Time: 21:37:11 Log-Likelihood: 81.025
No. Observations: 2417 AIC: -134.0
Df Residuals: 2403 BIC: -52.99
Df Model: 13
Covariance Type: nonrobust
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
const -56.5564 6.866 -8.237 0.000 -70.020 -43.093
main_camera_mp 0.0213 0.001 15.241 0.000 0.019 0.024
selfie_camera_mp 0.0143 0.001 13.404 0.000 0.012 0.016
ram 0.0230 0.005 4.571 0.000 0.013 0.033
weight 0.0016 6.04e-05 27.262 0.000 0.002 0.002
release_year 0.0287 0.003 8.441 0.000 0.022 0.035
normalized_new_price 0.4338 0.011 39.980 0.000 0.413 0.455
brand_name_Karbonn 0.1223 0.055 2.235 0.026 0.015 0.230
brand_name_Lenovo 0.0537 0.022 2.479 0.013 0.011 0.096
brand_name_Nokia 0.0609 0.031 1.977 0.048 0.000 0.121
brand_name_Xiaomi 0.0895 0.026 3.488 0.000 0.039 0.140
os_Others -0.1437 0.028 -5.077 0.000 -0.199 -0.088
4g_yes 0.0458 0.015 3.046 0.002 0.016 0.075
5g_yes -0.0653 0.031 -2.136 0.033 -0.125 -0.005
==============================================================================
Omnibus: 241.256 Durbin-Watson: 1.914
Prob(Omnibus): 0.000 Jarque-Bera (JB): 464.306
Skew: -0.654 Prob(JB): 1.50e-101
Kurtosis: 4.703 Cond. No. 2.91e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.91e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
All of the p-values are below 0.05 after dropping the high p-value columns. The R-squared and Adjusted R-squared have decreased to some extent, but they are still at a good number around 0.839 and 0.838.
#creating a plot with Residuals vs. Fitted Values
#making a dataframe with the Residuals and Fitted Values
lin_ind = pd.DataFrame()
lin_ind['Residuals'] = olsmodel8.resid #adding a column called Residuals with residuals from olsmodel8
lin_ind['Fitted'] = olsmodel8.fittedvalues #adding a column called Fitted with fitted values from olsmodel8
#creating residual plot using data from lin_ind, placing fitted on x axis and residuals on y-axis
#lowess is set to True to display curved line through the points to better understand the shape
#color is set to steel blue
sns.residplot(data=lin_ind, x='Fitted', y='Residuals', color='steelblue', lowess=True)
plt.xlabel('Fitted Values') #setting title of x-axis
plt.ylabel('Residuals') #setting title of y-axis
plt.title('Residuals vs. Fitted Values') #setting title of residual plot
plt.show(); #displaying residual plot
#creating a histogram for Residuals to view their distribution
#kde=True used to view density curve
sns.histplot(data=lin_ind, x='Residuals', kde=True) #using data from lin_ind, plotting Residuals on x-axis
plt.title('Distribution of Residuals') #setting title of histogram
plt.xlabel('Residuals') #setting title of x-axis
plt.show(); #displaying histogram
Although the distribution is very slightly skewed to the left, it seems to follow a normal distribution due to the visible bell-curve.
#creating a Q-Q plot of residuals
#importing scipy.stats to create probability plot
import scipy.stats as stats
#creating probability plot using data from Residuals and distribution is set to norm
#plot is set to plt because mathplotlib.pyplot is being used to plot graph points
stats.probplot(lin_ind['Residuals'], dist='norm', plot=plt)
plt.show(); #displaying probability plot
Although the ends are slightly different, the majority of the data appears to form a straight line.
#using Shapiro-Wilk test to check if Residuals are normally distributed
stats.shapiro(lin_ind['Residuals'])
ShapiroResult(statistic=0.9682212471961975, pvalue=1.0571945927045805e-22)
#using the goldfeldquandt test to examine homoscedasticity in Residuals
#importing statsmodels.stats.api and statsmodels.compat to carry out the goldfeldquandt test
import statsmodels.stats.api as sms
from statsmodels.compat import lzip
name=['F statistic', 'p-value'] #labeling the F statistic and p-value
test = sms.het_goldfeldquandt(lin_ind['Residuals'],x_train) #carrying out the test
lzip(name,test) #to display both F-statistic and p-value
[('F statistic', 1.0181008983065767), ('p-value', 0.37830265087249293)]
#displaying the final model again
olsmodel8 = sm.OLS(y_train, x_train).fit()
print(olsmodel8.summary()) #printing the summary of olsmodel8
OLS Regression Results
=================================================================================
Dep. Variable: normalized_used_price R-squared: 0.839
Model: OLS Adj. R-squared: 0.838
Method: Least Squares F-statistic: 965.3
Date: Thu, 16 Nov 2023 Prob (F-statistic): 0.00
Time: 21:37:14 Log-Likelihood: 81.025
No. Observations: 2417 AIC: -134.0
Df Residuals: 2403 BIC: -52.99
Df Model: 13
Covariance Type: nonrobust
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
const -56.5564 6.866 -8.237 0.000 -70.020 -43.093
main_camera_mp 0.0213 0.001 15.241 0.000 0.019 0.024
selfie_camera_mp 0.0143 0.001 13.404 0.000 0.012 0.016
ram 0.0230 0.005 4.571 0.000 0.013 0.033
weight 0.0016 6.04e-05 27.262 0.000 0.002 0.002
release_year 0.0287 0.003 8.441 0.000 0.022 0.035
normalized_new_price 0.4338 0.011 39.980 0.000 0.413 0.455
brand_name_Karbonn 0.1223 0.055 2.235 0.026 0.015 0.230
brand_name_Lenovo 0.0537 0.022 2.479 0.013 0.011 0.096
brand_name_Nokia 0.0609 0.031 1.977 0.048 0.000 0.121
brand_name_Xiaomi 0.0895 0.026 3.488 0.000 0.039 0.140
os_Others -0.1437 0.028 -5.077 0.000 -0.199 -0.088
4g_yes 0.0458 0.015 3.046 0.002 0.016 0.075
5g_yes -0.0653 0.031 -2.136 0.033 -0.125 -0.005
==============================================================================
Omnibus: 241.256 Durbin-Watson: 1.914
Prob(Omnibus): 0.000 Jarque-Bera (JB): 464.306
Skew: -0.654 Prob(JB): 1.50e-101
Kurtosis: 4.703 Cond. No. 2.91e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.91e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
#assigning the R-squared of the final model to r_squared variable
final_r_squared = olsmodel8.rsquared
#assigning the Adjusted R-squared of the final model to adjusted_r_squared variable
finaladj_r_squared = olsmodel8.rsquared_adj
#assigning Mean Absolute Error to final_mae_training variable for training set
final_mae_training = mean_absolute_error(y_train, olsmodel8.predict(x_train))
#assigning Mean Absolute Error to final_mae_testing variable for testing set
final_mae_testing = mean_absolute_error(y_test, olsmodel8.predict(x_test))
#assigning Mean Square Error to final_mse_training variable for training set
final_mse_training = mean_squared_error(y_train, olsmodel8.predict(x_train))
#assinging Mean Square Error to final_mse_testing variable for testing set
final_mse_testing = mean_squared_error(y_test, olsmodel8.predict(x_test))
#assigning Root Mean Square Error to final_rmse_training variable for training set
final_rmse_training = np.sqrt(final_mse_training)
#assigning Root Mean Square Error to final_rmse_testing variable for testing set
final_rmse_testing = np.sqrt(final_mse_testing)
#creating a dataframe with all of the above performance metrics
#creating small lists of performance metric labels and the corresponding variables from above to make one whole list called performance1
performance1 = [['R-squared', final_r_squared], ['Adjusted R-squared', finaladj_r_squared], ['Training MAE', final_mae_training],
['Testing MAE', final_mae_testing],['Training MSE', final_mse_training],['Testing MSE', final_mse_testing]
,['Training RMSE', final_rmse_training], ['Testing RMSE', final_rmse_testing]]
#creating a dataframe from the performance list and adding titles to the columns
final_model_performance = pd.DataFrame(performance1, columns = ['Performance Metric', 'Value'])
final_model_performance #dataframe with the metrics and corresponding values
| Performance Metric | Value | |
|---|---|---|
| 0 | R-squared | 0.839290 |
| 1 | Adjusted R-squared | 0.838421 |
| 2 | Training MAE | 0.182984 |
| 3 | Testing MAE | 0.187646 |
| 4 | Training MSE | 0.054753 |
| 5 | Testing MSE | 0.058312 |
| 6 | Training RMSE | 0.233994 |
| 7 | Testing RMSE | 0.241480 |
all_variables = len(x_train.columns) #assigning the length of the columns in x_train to a variable called all_variables
print('normalized_used_price =') #first printing the 'normalized_used_price = '
for i in range(all_variables): #for loop for every column in the length of x_train
if i == 0: #if the column coefficient is equal to 0
print(olsmodel8.params[i], '+') #print the coefficient and +
elif i < (all_variables-1): #else if the coefficient is not the last variable
print(olsmodel8.params[i],'*',x_train.columns[i], '+') #print coefficient*column and +
else: #else (last column)
print(olsmodel8.params[i],'*',x_train.columns[i]) #print coefficient*column
normalized_used_price = -56.55642408210383 + 0.021335092174868883 * main_camera_mp + 0.014296909931711427 * selfie_camera_mp + 0.02295780622642629 * ram + 0.0016475080269465867 * weight + 0.028734621986767902 * release_year + 0.43383415709499906 * normalized_new_price + 0.12233049116897698 * brand_name_Karbonn + 0.05366715043017031 * brand_name_Lenovo + 0.060899305072687704 * brand_name_Nokia + 0.08952854580976108 * brand_name_Xiaomi + -0.1437416519758557 * os_Others + 0.045806756390413136 * 4g_yes + -0.06531508684395132 * 5g_yes
normalized_used_price = -56.55642408210383 + 0.021335092174868883 main_camera_mp + 0.014296909931711427 selfie_camera_mp + 0.02295780622642629 ram + 0.0016475080269465867 weight + 0.028734621986767902 release_year + 0.43383415709499906 normalized_new_price + 0.12233049116897698 brand_name_Karbonn + 0.05366715043017031 brand_name_Lenovo + 0.060899305072687704 brand_name_Nokia + 0.08952854580976108 brand_name_Xiaomi + -0.1437416519758557 os_Others + 0.045806756390413136 4g_yes + -0.06531508684395132 * 5g_yes
When all of the other variables are constant:
The brands Karbonn, Lenovo, Nokia, and Xiaomi all have a positive effect on the price of used devices.
An operating system other than Android, iOS, and Windows can decrease the price of used devices by 0.1437 units. The category others can be very broad. Gathering more data on used devices with particular operating systems and analyzing their effect can be helpful in understanding whether they really create a negative impact on the price of a used device, when compared to Android, iOS, and Windows.
Having 5G seems to decrease the used device price by 0.0653 units, which is interesting.
Qualities of used devices that allow them to be sold for a higher price and can possibly help increase profit:
More research should be gathered on the effects of 5G and 4G on used device prices and why 5G seems to have a negative impact on the prices.